HCL Notes and Domino wiki: Domino troubleshooting: overview of causes of server Crash hangs due to "Recovery Manager: Log file is full" message

overview of causes of server Crash hangs due to "Recovery Manager: Log file is full" message

Added by ~Hal Prenugenynds on August 25, 2016 | Version 1

Actions ▼

What does this error mean? What maybe the impact? What to look for when the error occurs? What can the customer do to restore the service ASAP? What can be done to prevent this from recurring?

Tags: 8.5.3, 9.0.1, 9.x

What does this error mean?

"Recovery Manager : Log file is full" are informational message and it is related to our Transaction Logging,
It says that Transaction Logging are unable to write or re-use the extents in a timely fashion way, but there are couple of reasons behind
this informational messages.

1. If the storage of your Transaction Logging files are almost full
2. If there are Long Held Locks prior on experiencing the Log File is Full informational messages
3. If we are experiencing underlying performance issue which may cause on delays on writing to our Transaction Logging.

What is the impact?

When we start to see Log File is Full messages, we need to be concerned because from this point the databases can do reads, but the server
cannot update any of the databases, eventually leading into a hang. Server will not be accessible to user . If its mailing server communication will stop, if its application server then collaboration and transactions will stop.

What to look for when the error occurs?

The Log File is Full is an informational message that indicates that the Log file of the TXN (transaction log) is already full. Please note

that the log file being referred here is different from the log.nsf. The way transaction log works is that it sequentially records all changes and updates in the databases including the error messages it incur, as these error messages kept on recurring and are not fixed they
will accumulate causing the TXN log file to become full. The way to resolve this is to find out why those errors occur in the first place.

When this error start we have seen long held locks started coming on databases.

What this "long held lock" mean?

The "long held lock" error messages relating to the database are written by the lock manager.
Lock manager's sole purpose is to allow access to a database( that has transaction logging enabled) by other processes or users while, at the same time, allowing the transaction logging functionality to write to or retrieve data from a particular view within that database.

When lock manager first accesses the database, it needs to get that view from disk. It thus sends a request to the OS during which time the OS under normal circumstances fulfills that request and the transaction logging function writes or commits data to that database. However, if the disk request is not fulfilled and there are other processes which require access to the same data that lock manager has locked, messages as seen above begin to appear. The longer that data is held and the more processes/threads that need that data at that instance of time, the more messages will appear. These messages stop once the lock manager has released itself from that view. In other words, if a queued disk request is not fulfilled while other processes require access, then we loop through the lock manager function and print out the process and thread id for each process.

For example you will see below kind of messages start coming in server console log file.

    [18741:07315-2698827520] 07/13/2016 10:22:24 AM  Closed session for
Server1/Acme Databases accessed:     1   Documents read:     2
Documents written:     0   Transactions:     7
[18741:00146-1727678208] LkMgr BEGIN Long Held Lock Dump

[18741:00146-1727678208] Lock(Mode=X  * LockID(RRV
DB=/local/notesdata/mail/abc.nsf RRV=4962566)) Waiters
countNonIntentLocks = 2 countIntentLocks = 0, queuLength = 1
[18741:00146-1727678208]    Req(Status=Granted Mode=X Class=Commit
Nest=0 Cnt=2
[18741:00146-1727678208]        Tran=18652530 Func=N/A dbrrv.c:1211
[20119:00035-676361984])
[18741:00146-1727678208] rm_lkmgr_cpp:2125
[18741:00146-1727678208] rm_lkmgr_cpp:1358
[18741:00146-1727678208] dbrrv_c:510
[18741:00146-1727678208] ntupdate_c:2428
[18741:00146-1727678208] ntupdate_c:2321
[18741:00146-1727678208] ntupwrap_c:961
[18741:00146-1727678208] dbfolder_c:9384
[18741:00146-1727678208] foldwrap_c:367
[18741:00146-1727678208] nsfsem6_c:929
[18741:00146-1727678208]    Req(Status=Waiting Mode=S Class=Short
Nest=0 Cnt=0
[18741:00146-1727678208]        Tran=0 Func=N/A cursor.c:1278
[18741:03150--773032192] Delay=1min)
[18741:00146-1727678208] rm_lkmgr_cpp:2125
[18741:00146-1727678208] rm_lkmgr_cpp:1358
[18741:00146-1727678208] cursor_c:1019
[18741:00146-1727678208] nsfsem3_c:863
[18741:00146-1727678208] LkMgr END Long Held Lock Dump

After some time we saw below error recurring on server console,

1974:0002-1B24] Recovery Manager: Log File is Full
[14E8:0127-0744] 01/16/2016 04:29:44 AM Closed session for
Server1/Acme Databases accessed: 2 Documents read:
2 Documents written: 13303

**** DbMarkCorrupt(Unable to close container), DB=/local/notesdata/mail/abc.nsf
TID=[14E8:00C4-12F0] File=index\dbbuf.c Line=369 ***

[14E8:00C4-12F0] comp = 11, fnc = 10, probeid = 4, errcode = 2809,
alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
comp = 11, fnc = 10, probeid = 4, errcode = 2809, alert_num = 2
[1974:0004-09E0] comp = 11, fnc = 10, probeid = 4, errcode = 2809,
alert_num = 2

What can the customer do to restore the service ASAP?

So in this case we need to follow below recommendations:

- Need to recreate log.nsf
-Run offline maintenance on reported database during off hours or pull a clean replica from a non problematic server since the longheld locks caused
- Restart server on OS level. Restarting server on domino level also resolve this issue in few cases.

So reported database need to run offline maintenance with below parameters

nFixup -F (-j if transaction logging is enabled)
                   When you run Fixup on all databases, Fixup checks all documents in the database. Without this option, Fixup checks only documents modified since it last ran.
nUpdall -R -X
                   Rebuilds all used views. Using this option is resource intensive, so use it as a last resort to solve corruption problems with a specific database.
nCompact -c
                   Uses copy-style compaction. Use this option to solve database corruption problems.Reduces file-size. NOTE: Unused (white) space is not copied over, only data and view indexes

What can be done to prevent this from recurring?

This is a known issue in domino documented in SPR #JPAI9N5FP7. How it works is that, normally, domino keeps track of everything in memory
that was written out to disk, this issue is a case wherein the data was written out to disk, but domino did not mark the thing as clean, hence
leading to the recurring errors and eventually Log File is Full.

This problem is reported in Domino 8.5.x and Domino 9.0.x releases.

This is fixed in Domino 901 Fp6 and also we have provided Hot fix on Domino 8.5.3 Fp6.

We have seen in Few cases error message was resolved after taking below corrective actions:

1) If Disk issues are there , Make sure the disk Average queue length of the data drive should always less than 2. Also If the value of
Avg.Disk Queue Length exceeds twice the number of spindles, then you are likely developing a bottleneck. With a volume set,
a queue that is never shorter than the number of active physical disks indicates that you are developing a bottleneck.

Refer:
https://technet.microsoft.com/en-us/library/cc938625.aspx

Log reported as below

From NSD 1

    Platform.LogicalDisk.3.AssignedName = X
    Platform.LogicalDisk.3.AvgQueueLen = 20.01
    Platform.LogicalDisk.3.AvgQueueLen.Avg = 1.3551822659E+13
    Platform.LogicalDisk.3.AvgQueueLen.Peak = 1.8446741004E+17
    Platform.LogicalDisk.3.BytesReadPerSec = 10,363,750.4
    Platform.LogicalDisk.3.BytesWrittenPerSec = 2,517,321.5
    Platform.LogicalDisk.3.PctUtil = 2,001.35
    Platform.LogicalDisk.3.PctUtil.Avg = 1.3551599394E+13
    Platform.LogicalDisk.3.PctUtil.Peak = 1.8446437095E+17
    Platform.LogicalDisk.3.ReadsPerSec = 749.66
    Platform.LogicalDisk.3.WritesPerSec = 114.08

From Nsd 2

    Platform.LogicalDisk.3.AssignedName = X
    Platform.LogicalDisk.3.AvgQueueLen = 6.73
    Platform.LogicalDisk.3.AvgQueueLen.Avg = 1.3514095974E+13
    Platform.LogicalDisk.3.AvgQueueLen.Peak = 1.8446741004E+17
    Platform.LogicalDisk.3.BytesReadPerSec = 5,924,886.85
    Platform.LogicalDisk.3.BytesWrittenPerSec = 1,942,539.93
    Platform.LogicalDisk.3.PctUtil = 672.56
    Platform.LogicalDisk.3.PctUtil.Avg = 1.3513873330E+13
    Platform.LogicalDisk.3.PctUtil.Peak = 1.8446437095E+17
    Platform.LogicalDisk.3.ReadsPerSec = 238.41
    Platform.LogicalDisk.3.WritesPerSec = 202.79

2)Also seen Data located that drive is become full

Seen as out of disk space (90+ % full) on all drives. Need to
add more space or move data first. Without free disk space, we cannot
continue to add more data as seen below

<@@

System Data -> Local Disks (Time 18:37:24)

@@>

DRV       Total(GB)        Used(GB)        Free(GB)   % Used Remote
Connection
C:\          100.00           37.43           62.57   37.43%
drive> F:\         1228.87         1215.34           13.53   98.90%
drive> L:\          100.00           72.76           27.24   72.76%
drive> W:\          250.02          231.21           18.81   92.48%
drive> X:\         1124.14          927.34          196.80   82.49%
drive> Y:\         1024.08          977.71           46.37   95.47%
drive> Z:\         1126.55         1090.08           36.47   96.76%
drive>

3)Apart from this, I also see that store transaction log directory
under data folder itself

TRANSLOG_Path=logdir

We don't recommend to keep transaction logs under Data directory as
that can cause server performance issues. There should be a separate
dedicated drive for transaction logs.

When transaction logs don't get flushed due to disk issues, it starts
generating error 'Recovery Manager - Log file is full'

4) Lot of times seen Database corruption prior to start this error message.

Entry not found in index
Line 3832:     Line 1277838: [1ED0:001B-1E28] 07/28/2016 05:38:14 PM
Monitor Error in Database Y:\Lotus\Domino\Data\Mail\ABC.nsf:
Entry not found in index
Line 3837:     Line 1280621: [1ED0:0023-2690] 07/28/2016 05:39:50 PM
Monitor Error in Database X:\Lotus\Domino\data\mail\XYZ.nsf: Entry
not found in index
Line 3840:     Line 1284393: [1ED0:0020-23E8] 07/28/2016 05:40:27 PM
Monitor Error in Database X:\Lotus\Domino\data\mail\XYZ.nsf: Entry
not found in index
Line 3843:     Line 1286901: [1ED0:001F-2050] 07/28/2016 05:40:54 PM
Monitor Error in Database X:\Lotus\Domino\data\mail\XYZ.nsf: Entry
not found in index

To fix residual corruption and prevent these errors from showing up
again, run offline maintenance (fixup, updall, compact) on your
Domino data mail directories.

5)In one case we found type of transaction logging used is circular and TXN log is
hitting beyond 4 GB which is hitting out of limit.

<@@ Directory Listings -> Transaction Log Full Listing By Time @@> /data/translog:
total 4129024

CREATE_R9_DATABASES=1
    TRANSLOG_Path=/data/translog
    Previous_TRANSLOG_Path=/data/translog/
    Previous_TRANSLOG_Style=0
    TRANSLOG_MaxSize=10240
    Previous_TRANSLOG_Status=1

So need to follow bellow steps to resolve this error message

A) As using Circular transaction logging change the size to 4096 MB, Transaction logging with circular
cannot go beyond 4GB.

B) Change the console Logging to atleast 200 MB, If console log limitation is set to too low
which as overwritten the information and couldn't find which
database has gone bad and caused the error message Log file is Full.
Parameter to be added is Console_Log_Max_Kbytes=204800 requires server
restart.

C)Another option left is to review the log.nsf -> Miscellaneous events
and check the documents before it started displaying the message "Log
file is full", you may find a specific database as corrupt.

D) You may also create the event to send the alert to specific mail id
when the keyword corrupt is captured.

Actions ▼

Attachments (0)

Versions (1)

Comments (0)

Copy and paste this wiki markup to link to this article from another article in this wiki.

Link:

Recent articles by this author

Best practices for setting up optimal performance of Domino server

Why I did not get a 'No matching web site' error when there is no matching internet site?

Why do I see weird results when connecting to the Domino Java Console in a browser?

overview of causes of server Crash hangs due to "Recovery Manager: Log file is full" message